Goto

Collaborating Authors

 Haeju



SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers

Manem, Chaitanya, Brahma, Pratik Prabhanjan, Mishra, Prakamya, Liu, Zicheng, Barsoum, Emad

arXiv.org Artificial Intelligence

The demand for Large Language Models (LLMs) at multiple scales, capable of sophisticated and sound mathematical reasoning, continues to grow. However, the development of performant mathematical LLMs is often bottlenecked by the scarcity of useful training data containing problems with significant complexity. We introduce \textbf{SAND-Math} (\textbf{S}ynthetic \textbf{A}ugmented \textbf{N}ovel and \textbf{D}ifficult Mathematics problems and solutions), a pipeline that addresses this by first synthesizing high-quality problems from scratch and then systematically elevating their complexity via a our newly proposed \textbf{Difficulty Hiking} step. We demonstrate the effectiveness of our approach through two key findings: \textbf{(1)} Augmenting a strong post-training baseline with a small 500-sample SAND-Math dataset significantly boosts performance, outperforming the next-best synthetic dataset by $\uparrow$ 17.85 absolute points on AIME25 benchmark. \textbf{(2)} In a dedicated ablation study, we show the effectiveness of our Difficulty Hiking process in increasing average problem difficulty from 5.02 to 5.98. This step consequently lifts AIME25 results from 46.38\% to 49.23\%. The full generation pipeline, final dataset, and a fine-tuned model form a practical and scalable toolkit for building capable and efficient mathematical reasoning LLMs.


PANORAMA: A Dataset and Benchmarks Capturing Decision Trails and Rationales in Patent Examination

Lim, Hyunseung, Nam, Sooyohn, Na, Sungmin, Cho, Ji Yong, Yang, June Yong, Shin, Hyungyu, Lee, Yoonjoo, Kim, Juho, Lee, Moontae, Hong, Hwajung

arXiv.org Artificial Intelligence

Patent examination remains an ongoing challenge in the NLP literature even after the advent of large language models (LLMs), as it requires an extensive yet nuanced human judgment on whether a submitted claim meets the statutory standards of novelty and non-obviousness against previously granted claims -- prior art -- in expert domains. Previous NLP studies have approached this challenge as a prediction task (e.g., forecasting grant outcomes) with high-level proxies such as similarity metrics or classifiers trained on historical labels. However, this approach often overlooks the step-by-step evaluations that examiners must make with profound information, including rationales for the decisions provided in office actions documents, which also makes it harder to measure the current state of techniques in patent review processes. To fill this gap, we construct PANORAMA, a dataset of 8,143 U.S. patent examination records that preserves the full decision trails, including original applications, all cited references, Non-Final Rejections, and Notices of Allowance. Also, PANORAMA decomposes the trails into sequential benchmarks that emulate patent professionals' patent review processes and allow researchers to examine large language models' capabilities at each step of them. Our findings indicate that, although LLMs are relatively effective at retrieving relevant prior art and pinpointing the pertinent paragraphs, they struggle to assess the novelty and non-obviousness of patent claims. We discuss these results and argue that advancing NLP, including LLMs, in the patent domain requires a deeper understanding of real-world patent examination. Our dataset is openly available at https://huggingface.co/datasets/LG-AI-Research/PANORAMA.


PRISM: Agentic Retrieval with LLMs for Multi-Hop Question Answering

Nahid, Md Mahadi Hasan, Rafiei, Davood

arXiv.org Artificial Intelligence

Retrieval plays a central role in multi-hop question answering (QA), where answering complex questions requires gathering multiple pieces of evidence. We introduce an Agentic Retrieval System that leverages large language models (LLMs) in a structured loop to retrieve relevant evidence with high precision and recall. Our framework consists of three specialized agents: a Question Analyzer that decomposes a multi-hop question into sub-questions, a Selector that identifies the most relevant context for each sub-question (focusing on precision), and an Adder that brings in any missing evidence (focusing on recall). The iterative interaction between Selector and Adder yields a compact yet comprehensive set of supporting passages. In particular, it achieves higher retrieval accuracy while filtering out distracting content, enabling downstream QA models to surpass full-context answer accuracy while relying on significantly less irrelevant information. Experiments on four multi-hop QA benchmarks -- HotpotQA, 2WikiMultiHopQA, MuSiQue, and MultiHopRAG -- demonstrates that our approach consistently outperforms strong baselines.


A Survey on Parallel Reasoning

Wang, Ziqi, Niu, Boye, Gao, Zipeng, Zheng, Zhi, Xu, Tong, Meng, Linghui, Li, Zhongli, Liu, Jing, Chen, Yilong, Zhu, Chen, Wu, Hua, Wang, Haifeng, Chen, Enhong

arXiv.org Artificial Intelligence

With the increasing capabilities of Large Language Models (LLMs), parallel reasoning has emerged as a new inference paradigm that enhances reasoning robustness by concurrently exploring multiple lines of thought before converging on a final answer. It has become a significant trend to explore parallel reasoning to overcome the fragility of standard sequential methods and improve practical performance. In this paper, we aim to survey and summarize the progress and challenges of parallel reasoning. We first present a formal definition of parallel reasoning and clarify its distinction from related concepts like Chain-of-Thought. Then, we organize and discuss advanced techniques based on a novel taxonomy, including non-interactive reasoning, interactive reasoning, and efficiency-focused decoding strategies. Additionally, we explore various application scenarios, such as solving complex problems and enhancing the reliability of LLM outputs.Finally, we highlight the core challenges of parallel reasoning and suggest potential directions for future research. We hope that our work can provide a useful roadmap for beginners and encourage more research on improving parallel reasoning methods. Related source can be avaliable in https://github.com/PPPP-kaqiu/Awesome-Parallel-Reasoning.


Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning

Zhao, Shu, Yu, Tan, Xu, Anbang

arXiv.org Artificial Intelligence

Reasoning-augmented search agents, such as Search-R1, are trained to reason, search, and generate the final answer iteratively. Nevertheless, due to their limited capabilities in reasoning and search, their performance on multi-hop QA benchmarks remains far from satisfactory. To handle complex or compound queries, we train an LLM-based search agent with the native capability of query expansion through reinforcement learning. In each turn, our search agent proposes several query variants, which are searched simultaneously to cover more relevant information. Meanwhile, given limited post-training data and computing resources, it is very challenging for a search agent to master multiple tasks, including query generation, retrieved information understanding, and answer generation. Therefore, we propose incorporating a pre-trained squeezer model that helps the search agent understand the retrieved documents, allowing the search agent to focus on query generation for high retrieval recall. With the assistance of the squeezer model, we discover that even a small-scale 3B LLM can demonstrate a strong capability of query expansion and achieve state-of-the-art accuracy on the multi-hop QA benchmarks. To be specific, our experiments across seven question-answering benchmarks demonstrate that our method, named ExpandSearch, achieves an average improvement of 4.4% compared to state-of-the-art baselines, with strong gains on multi-hop reasoning tasks requiring diverse evidence aggregation.


RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems

Jin, Hyundong, Hahn, Joonghyuk, Han, Yo-Sub

arXiv.org Artificial Intelligence

Large language models (LLMs) show strong performance across natural language processing (NLP), mathematical reasoning, and programming, and recent large reasoning models (LRMs) further emphasize explicit reasoning. Yet their computational limits, particularly spatial complexity constrained by finite context windows, remain poorly understood. While recent works often focus on problems within the NP complexity class, we push the boundary by introducing a novel benchmark grounded in two PSPACE-complete regular expression (regex) problems: equivalence decision (RegexEQ) and minimization (RegexMin). PSPACE-complete problems serve as a more rigorous standard for assessing computational capacity, as their solutions require massive search space exploration. We perform a double-exponential space exploration to construct a labeled dataset of over a million regex instances with a sound filtering process to build the benchmark. We conduct extensive evaluations on 6 LLMs and 5 LRMs of varying scales, revealing common failure patterns such as verbosity and repetition. With its well-defined structure and quantitative evaluation metrics, this work presents the first empirical investigation into the spatial computational limitations of LLMs and LRMs, offering a new framework for evaluating their advanced reasoning capabilities. Our code is available at https://github.com/hyundong98/RegexPSPACE .



Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

Wang, Ziliang, An, Kang, Zheng, Xuhui, Qian, Faqiang, Zhang, Weikun, Ouyang, Cijun, Cai, Jialu, Wang, Yuhang, Wu, Yichao

arXiv.org Artificial Intelligence

While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.


TRUEBench: Can LLM Response Meet Real-world Constraints as Productivity Assistant?

Park, Jiho, Song, Jongyoon, Choi, Minjin, Heo, Kyuho, Huh, Taehun, Kim, Ji Won

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly integral as productivity assistants, but existing benchmarks fall short in rigorously evaluating their real-world instruction-following capabilities. Current benchmarks often (i) lack sufficient multilinguality, (ii) fail to capture the implicit constraints inherent in user requests, and (iii) overlook the complexities of multi-turn dialogue. To address these critical gaps and provide a more realistic assessment, we introduce TRUEBench (Trustworthy Real-world Usage Evaluation Benchmark)1, a novel benchmark specifically designed for LLM-based productivity assistants. TRUEBench distinguishes itself by featuring input prompts across 12 languages, incorporating intra-instance multilingual instructions, employing rigorous evaluation criteria to capture both explicit and implicit constraints, and including complex multi-turn dialogue scenarios with both accumulating constraints and context switches. Furthermore, to ensure reliability in evaluation, we refined constraints using an LLM validator. Extensive experiments demonstrate that TRUEBench presents significantly greater challenges than existing benchmarks; for instance, a strong model like OpenAI o1 achieved only a 69.07% overall pass rate. TRUEBench offers a demanding and realistic assessment of LLMs in practical productivity settings, highlighting their capabilities and limitations.